Are Malignant Cancer Cells Really Bigger?

10 December 2025

Ryan Mooney

The goal

Determine if there is a correlation between diagnosis of a tumor and tumor size.

The method to answer this question: permutation test!

If you are here from my website, and want to go back, please click here!

The data

The data set is from William Wolberg et. al, (1993) in Biomedical Image Processing and Biomedical Visualization., which contains data on 568 breast cancer tumor samples. The data set was downloaded from UC Irvine Machine Learning Repository.

What does the data look like?

head(tumor_data)
# A tibble: 6 × 32
        id diagnosis radius_mean texture_mean perimeter_mean area_mean
     <dbl> <chr>           <dbl>        <dbl>          <dbl>     <dbl>
1   842517 M                20.6         17.8          133.      1326 
2 84300903 M                19.7         21.2          130       1203 
3 84348301 M                11.4         20.4           77.6      386.
4 84358402 M                20.3         14.3          135.      1297 
5   843786 M                12.4         15.7           82.6      477.
6   844359 M                18.2         20.0          120.      1040 
# ℹ 26 more variables: smoothness_mean <dbl>, compactness_mean <dbl>,
#   concavity_mean <dbl>, concave_points_mean <dbl>, symmetry_mean <dbl>,
#   fractal_dimension_mean <dbl>, radius_se <dbl>, texture_se <dbl>,
#   perimeter_se <dbl>, area_se <dbl>, smoothness_se <dbl>,
#   compactness_se <dbl>, concavity_se <dbl>, concave_points_se <dbl>,
#   symmetry_se <dbl>, fractal_dimension_se <dbl>, radius_worst <dbl>,
#   texture_worst <dbl>, perimeter_worst <dbl>, area_worst <dbl>, …

Benign vs Malignant Tumors (National Cancer Institute, 2001)

Major Differences: Circularity, nucleation, rigidity, size?

Hypotheses

The null hypothesis: benign tumors and malignant tumors have cells of the same average area.

The alternative hypothesis: malignant tumors have larger average cell areas.

The Variables of Interest and Test Statistic

The variables - diagnosis (whether the tumor is malignant or benign), - designated by an ‘M’ or a ‘B’ in the diagnosis column - mean tumor cell area - a calculated value for each sample in the area_mean column.

The test statistic is the difference in means between area in the benign and malignant tumor samples.

Let’s have a look at the observed test statistic

tumor_data |> 
  group_by(diagnosis) |> 
  summarize(ave_area = mean(area_mean))
# A tibble: 2 × 2
  diagnosis ave_area
  <chr>        <dbl>
1 B             463.
2 M             978.

So, it looks like the mean area of malignant tumor cells is larger than that of benign tumor cells. However, is that generalizable to other breast tumors? Off to the permutation test!

The Permutation Test

set.seed(47)
perm_data <- function(rep, data) {
  data |>
    select(diagnosis, area_mean) |>
    mutate(area_perm = sample(area_mean, replace = FALSE)) |>
    group_by(diagnosis) |>
    summarize(
      obs_mean  = mean(area_mean),
      perm_mean = mean(area_perm)) |>
    summarize(
      obs_mean_diff  = diff(obs_mean),
      perm_mean_diff = diff(perm_mean),
      rep = rep
    )
}

map(c(1:1000), perm_data, data = tumor_data) |> 
  list_rbind()
# A tibble: 1,000 × 3
   obs_mean_diff perm_mean_diff   rep
           <dbl>          <dbl> <int>
 1          515.         30.9       1
 2          515.        -52.8       2
 3          515.         63.0       3
 4          515.         24.5       4
 5          515.         -4.64      5
 6          515.          0.188     6
 7          515.         -6.70      7
 8          515.          9.85      8
 9          515.         45.6       9
10          515.        -55.7      10
# ℹ 990 more rows

Visualizing the null distribution

What’s the p value?

perm_stats |> 
    summarize(p_val = mean(perm_mean_diff > obs_mean_diff))
# A tibble: 1 × 1
  p_val
  <dbl>
1     0

Conclusions

The permutation test gave a p-value of 0.

The observed difference in mean cell size between malignant and benign breast tumors was 515.479.

This difference did not occur once in 1,000 random permutations.

The extremely small p-value provides very strong evidence against the null hypothesis.

Therefore, the results suggest that all malignant breast cancer cells have larger average sizes than benign breast cancer cells.

Implications

This means that average cell size could potentially serve as a potential quantitative metric for the rapid and automated classification of tumor malignancy.

References

“Normal and Cancer Cells Structure: Image Details.” NCI Visuals Online, National Cancer Institute, (2001). visualsonline.cancer.gov/details.cfm?imageid=2512.

Street, W.N., Wolberg, W.H., & Mangasarian, O.L. “Nuclear feature extraction for breast tumor diagnosis.” (1993) Proc. SPIE 1905: Biomedical Image Processing and Biomedical Visualization. https://doi.org/10.1117/12.148698

Wolberg, W., Mangasarian, O., Street, N., & Street, W. “Breast Cancer Wisconsin (Diagnostic)” (1993) UCI Machine Learning Repository. https://doi.org/10.24432/C5DW2B